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ABSTBACT 

The adequacy of a test; developed for statei^Kde 
assessaent of basic aatheaatics skills was investigated. The ^^t, / 
coaprised of aultiple-choice iteas reflecting a series of behavx^al 
objectives, vas coapar^d vith a aore extensive criterion aeasure 
generated froa the same objectives by the application of a strict 
it^B saapling Model. In many instances, the tvo instruaents provided 
different classifications ,of students regarding aastery of an 
objective. Many of the discrepancies vere attributed to t>e saall 
nufaber of it^ls per objective and to the aultiple^choice foraat of 
the original test, cons^uently, the use of criterion^ referenced 
tests in situations t^at severely liait test length and itea f craatv 
options Mas questi6ned« In addition, the prpbleas associated vith'the 
practice of assuming content validity for criterion^ref e^^enced tests 
vere discussed. (Author) ^ ' * ' 
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Tiie tt)pic of criterion-referenced {measurement has received considerable 
attention during the past decade. Much of the initial controversy that was 
generattd over tiu^ relative merits of criterion-ref er^^Snced and norm-referenced 
measurement appears to have subsided. Today, most psychometricians seemingly 
agree that cr itifer Lon-ref erenced and norm-referenced measurement iiave differ- 
ent purposes, and that each ^s ^appropriate under the circumstances for which 
it was intended. Norm-r*»f erenced measures are generally more appropriate in 
Selection situations while criterion-referenced instruments facilitate classl- 
fication decisions regarding an examinee's position relative to a specified 
objettive. TT'he determining factor in the selection of a measurement technique 
is the type' of information required by the decision maker. 

The value of direct or absolute measures of student achievement relative 
to an instructional objective has been demonstrated repeatedly for at least 
two types of educational decision making. Instructional developers need highly 
specific information about the attainment of educational objectives in order 
to validate learning tmaterials. Likewise, instructional managers need de- 
tailed information 'about the status of each of their students for monitoring 
the achievement of prescribed learning objectives. Each of these decision 
malcers h^s become dependent upon criterion-referenced measurement for acquiring 
the necessary performance information. 



Recently, the application of criter ioti-ref erenced measurement has 
been extended to survey achievement testing situations such as state 
assessment programs. The educational -accountability movement has created 
a need for specific information concerning the achievement of common educa- 
tional objeciives in order to establish minimum educational standards. In 
discussing the accountability issue, Hartnett (1971), described the moveiaent 
of education toward operational statements of educational objectives as c 
basis for more j>recise measurement pf educational effectiveness, A typical 
example of the movement is the "accountability act" and "state assessment" 
programs adopted in Florida. 

In establishing objective-based state assessment programs, the objective- 
based measurement techniques that have proven so useful for making instructional 
development and management decisions provided an obyious tool* The logic of 
such an extension in the use of criterion-referenced instrumen^ts cannot be 
argued; however, the decision to employ such instr\jments was made without 
evidence of the suitability of criterion-referenced measurement for large- 
scale te&ting situations. Utilization of the technique for survey testing 
may present additional problems to the theoretical and methodological prob- 
lems taced by all users of criterion-referenced instruments. In particular, 
the magnitude of data collected in survey testing practically dictates the 
nature of 'usable instruments. For cost efficiency the responses must be 
readily obtainable and machine scoreable: thus, a multiple choice 'or similar 
format for such instruments appears mandatory. Kriewall (1969), suggested 
that the measurement error introduced by tests of reasonable length with 
such a format, severely limits the reliability of decisions concerning the 
■proficiency of individuals. The present paper addresses some of the problems 
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associated with the adaptation of criterion-referenced measurement techniques 
to situations which require the collection of voluminous data such as survey 
achievement testing. 



Hethod 

The Florida State-Wide Eighth-Grade Test includes a section designed to 
assess basic mathematics skills consid6!red essential for everyday living. 
The test was designed to measure nine skills which had been defined by a 
set of beliavioral objectives. For each objective three multiple choice 
items were written to assess the skill identified by the objective. 

For the present investigation, ten-item domain-referenced tests (Millman, 

1973) were constructed to serve as criterion measures of a selectsd subset 

of the nine objectives.* Construction of the items followed an it^m form 

approach (Hively, et.al., 1969; Osborn, 1968). Common wording was adopted 

for each item in a given criterion measure, but unique numbers were randomly 

generated for each item by a stratified sampling plan. In an effort to keep 

' >• 
the items as similar as possible to the items found in the Eighth-Grade Test, 

/ 

numbers used in the criterion measure were restricted to a range consistent 
with the numbers in the Eighth-Grade Test. The results reported lerein were: 
obtained by :idmini^tration oJL the two different measures of the following 
objectives: 

1. 'Cost Comparison: Given the prices of two articles, the student 

will determine the difference in cost. ^ 

2. Travel Time: Given the distance between two points and a rate of 

travel, the student will determine the required travel time. 

3. Time Difference&i Given two times of day, the student will determine 

the differences in time. 
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Cost Comparison/ 

Format: Two articles are priced at $ and . What is the 

^ difference in cost of the two articles? 

Parameters: 

(1) Cost Difference (d): 

d = Pi - P2 
where, ^$0.01 1 d < $299.50 «? $0.01 intervals) 

(2) Cost of First Article (P^): ' 

Pi = $0.01 a ' 
where, 50 •£ a <_ 30000 

(3) Cost of Second Article (Pj): 

Pa = $0.01 b 
where, .50 £ b < 30000 
and, b ^ a . ^ 

I 

Travel time: 

Format: A car travels d miles at an average &peed of r per 
hour. How many hours does the trip take? 

Parameters: 

(1) Travel Time (t) : • 

t = d/r 

where, 1/2 hr. £ t £ 30 hrs. ((? 1/2 hr. intervals) 

(2) Distance Traveled (d): 

d =i 10 a 
where, 2 < a £ 30 

(3) Speed or Rate of Travel (r) : 

r « 10b 
where, 1 £ 6 £ 8 

Time Difference: 

Format: If the time is ^ how long will it be until ^2 ? 

Parameters: 

(J) Time Difference (T) : 

T - t^ - ti 
where, 1/4 hr. £ T £ 12 hrs." 



(2) Initial Time (tj): 

ti « 12:00 p.m. + a/4 hrs, 
where, 0 < a £ 95 

(3) Final Time (ta): 

t2 = ti + b/4 hrs. 
where, 0 £ b <_ 47 

*. 

The dom.. Ln-ref erenced tests were developed to provide criteria for 
determining the concurrent validity of the objective-based subscales, in 
the Florida Eighth-Grade Test. It was realized that any Indication of the 
the validity of the subscales would be limited by the degree to which 
the criterion measure provided valid information concerning mastery of 
tht! object. .ves. Although the validity of the criterion measures could not 
be guaranteed, it was assumed that the specification and use of explicit' item 
generation rules would at least facilitate the rendering of judgments about 
their apparent content validity- -To the extent that the item generation 
rules reflect the original Intent the objectives, validity of the criterion 
measures would be expected to exist. 

It was assumed 'that for a given objective there exists two populations, 
masters and non-masters. Based upon this assumption, a reliable test woifQd 
produce two distinct Jlstributlons of scores, one for each population. Com- 
bining the observed scores of all examinees, i.e., both masters and non- 
masters, would be expected to produce a blmodal dlsttlbutlon with the mastery 
group receiving scores equal to the maximum possible score less the number 
of careless errors and -the non-mastery group receiving scores of zero plus 
the number of lucky guesses* Thus, jthie degree of overlap of the two distri- 
butions could Be taken as an indication of the amount of measurement error 
'In the scores. 

/ 

/ ■ 

/ 
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The criterion measures were administered t'o 151 eighth-grade stude^jts 
who had taken the Florida State-Wide Eighth-Grade Test. Compenttncy Class- 
ifications (i.e., mastery and non-mastery status) provided by the Eighth- 
Grade Test were compared with classifications (vbtained with the c riterion 
measures of. the same skills. Comparison of the two instruments was intended 
to provide an indication of the feasability of using survey tests for making 
criterion-referenced interpretations. 

^ ' Results 

Table 1 presents (1) the proportion of students declared masters of ^ 

t 

each objective according to their performance on the three-item sub scales 
of the Florida Eighth-Grade Test, (2) the proportion of students declared 
masters of each objective according to their performance on the ten-item 
criterion measures, (3) the proportion of cases in which examinees were given 
the same classification by both measures, and (4) the produdt moment cor* 

relations between the scores produced by the two measures. The Florida Eighth-^ 

r 

Grade Program had specified a minimum standard for mastery classification , 
of two 'out \>f three items correct. Primarily for consistency, a two-thirds 
standard, i e. , seven out of ten items correct, was also adopted for tljte, 
criterion mi isure. * Other factors influencing selection of the cut-off 
score for t* criterion measures are discussed In the nejct section. 

Figures 1-1 present the dlstrlbutlonSvX>f scores obtained on the ten- 
item criterion reasures for objectlvies l-3i'respectlvely . In addition, 
Figures 2 and 3 display the effect upon score distributions of broaden- 
ing the objectives through modlf Icatlofi of the item generation rules. , 

In Figured?, the solid line Indicates the score distribution produced 
when the domain of travel time Items was restricted to the problem set 
having fractional solutions of one— half hour (e.g. 1^^ hr., 7}^ hr., etc.). 

/ 
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The score distribution represented by the broken line was produced by 
stratified sampling of items from the domain having both Integer sol- 
utions and half-liour fractional solutions. 

Table I • . 



- Indications of-Agreerrent Between the Florida Eighth-Grade Test 
(H-GT) and Domain-Referenced Criterion Measures (CM) Concerning 
Examinee Proficiency of Certain Basic Mathematics Skills 
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on E-GT 


on ClT 


in Class. 


E-GT h CM 


1. 


Cost Difference . .91 


.76 


.84 


.54 


2. 


Travel Time .85 


.52 


.65 


.51 


3. 


Time Difference .74 


.50 


.68 


•.57 



Figure 3 presents the score distribution for Objective 3 with a sample 
of items randomly generated from a domain containing various combinations 
of the following stratif ications^ (1) a.m. only, p.m. only, a.m. to p.m.; and 
p.m. to a.m., (2) time differences of whole hours, half hours, and quarter hou 
and (3) initial times starting on the whole hour, half hour, and quarter 
hour. 

Discussion 

The instruments compared in the present investigation showed consider* 
able discrepancy in the classification of examinees as masters or non-masters 
of the skills specified by the objectives. Both instruments had been Judged 
to possess content validity by virtue of their apparent consistency with 
the prc-stated objectives. Undoubtedly, both of the jtcsts were measuring » the 
corresponding skills to some extent. Problems arose, however, because 
demonstration of the ability to perform a given objective often required - 
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subordinate or concoraialtant skills in addition to the primary skxll specified 
by the objective- For example, the calculation of travel tlme» as specified 
by Objective 2, required the ability to perform certain basic mathematical 
computations In t^rdei* to solve the verbal rate problems. As a result, 
minor changes of the item generation rules to include problems with whole- 
hour solutions as well as half-hour solutions produced quite noticeable 
ch/mges in the score distributions^ Thus, the broken line >n Figure 2 seeming- 
ly identifies two types of masters of the skill of calculating travel time. 
One group of masters could solve travel time ffroblems with either whole- 
hour or .half--hour solutions while a lesser number of examinees could solve 
travel time problems but only for problems with integer solutions. It 
seems likely that the Inclusion of problems Involving other fractions that ate 
less common than one-half would tend to confound the results' even further. 

Although representing different objectives, F1[gure3 1 and 3 futther demon^- 
strate the effect of changing the item generation rules to broaden the domain 
of Itemfe included. The bimodal characteristics of the store distribution / 
presented in figure 1 suggest that parameters for Objective 1 define a f 
rather narr(»w and horaogenieous domain of items- In contrast, the measure of 
Objective 3, whlrh included numbers representing three stratifications 
sp'V if led b\ the item ft^^iierntion rules, produced a more rectangular score 
distribution. Arparently, a number of exa^roi^c^os were tible calculate time 
differences hut < ither had not master.ed the concept of a.m. and p.m- or 



had difficulty s »lving the problems that required the use of certain fractional 
portions of ,in hour. 

It should hi: rero^ljbered that the verbal conjtont of the items in each problem 
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sot used In the present Invest igation was held constant. Changes in the 
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vocabulary and format of the items would be expected to exert additional in- 
fluence upon the resulrfrs. As the Item domains lacrease in breadth and the 
items become more and more heterogeneors, this type of confounding influence 
tends to increase, and it becomes more and more difficult to make absolute 
statements about what examinees can and cannot do. 

It* is recommended that developers of criterion-referenced instruments 
devote considerable effort to activities ttiat lead to increased precision of 
^the objectives. It is often possible to employ procedures used in task 
analysis in the identification of capabilities that might be expected to 
influence performance of the skilJX identified by an objective. In particular, 
the test writer should look for pre-requisite capabilities that appear to be 
at a difficulty level that is rel^/iveiy similar to that of the prlmaxry 
scale. For example, one might have predicted that the ability to manipulate 
mixed fractions would affect the performance of middle school students* on 
travel time problems with fractional solutions. At the same time, one would 
not expect the inclusion iof fractions in a set of wave mechanics problems to. 
influence t lie performance of college physics majors. In instances where the 
potentin influcMice cf unspecified objectives jlL less obvious. It may.be 
necessary to tryout the problems empirically in order to determine the extent 
of confounding for a given group of ^aminees. ' 

The confounding of test results arising from the measurement of two, or 
more skills simultaneously would be expected to increase as the item genera- 
tion rules introduce more and more heterogeneity into the problem set. Since 
confounding increases the number of scores falling in the middle of the 
pos^i^X€r^range, the degree of overlap between the mastery and non-mastery score 
distributions would also increase. Likewise, the r^umber of scores at or 
near -3ny selected mastery cut-off score would increase, thus increasing the 
likelihood of mis-classifying an individual with such an observed score. In 
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this situation, classification results would be influenced to a much 

*- 

greater degree by the selection of a cut--off score. For example, i^ on a 

: • . ' / 

ten-item test no observed scores are found in the range from three to 
seven, the selection of any score within that range as tjhe cvt-off score 
will lead to the same classification of examinees • In the present investi- 
gation a cutting score of seven was arbitrarily adopted. Figures 1-3 
8U£»;gest, however, that such a selection was fairly appropriate for mini- 
mizing the number of false positive classifications. If the cons^quenpes of 
a false negative classification were more important, a lower cutting score 
might be more suitable. In any ev<rTtt, for a homo;;eneous set of items such 
as the ones used to measure Objectiye 1, such changes in the cutting score . 
adopted would have very little effect upon the results. 

. Reliability, In the sense of replicability of competency classifications 
relative to a given objective, would seem to be high for a bimodally dis- 
tributed 'set of scores. In effect, each item from a homogeneous domain 
serves as a' replication of the measurement of ain indlvl'dual ' s proficiency 
relative to a ^Iven objective* . Naturally, such homogeneous measures are 
highly consistent internally, and as long as both masters and non-masterrs 
of a given objective are included in the test samfle, KR-20 estimates of 
reliability will he high. * ' 

Much of the discrepancy in the classification' of examinees that resulted 
from comparing performance on the two different measures can probably be 
attributed to the measurement error accompanying the subscales of the Florida 
Els?hth-Grade Test. Primary factors contributing to this measurenent error 
wimfd be the use of thre^*ltem tests for each objective and the us,e of a 

7 

multlplie-choice format. Although the exact effect of J:.he multiple-choice 
format upon the 'leasurement of behavioral objectives cannot be <ietermined. 



It seems likely that a test with such a format' would require more Items in 
order to yield a reliable measurement Chan would a test with a free response 
f ormat. 

Even with free response items, a number of factors appear to have an 
influence upon the number of items required to provide a reliable measure 
of a specified objective. First, as the objective becomes broader and the 
;fcest becomes more heterogeneous, the length of the test must.be increased 
to maintain measurement precision. Figure 1 suggests that even for highly 
homogeneous tests, four or five items may be necessary to minimize classi- 
fication errors. Second, the number of items required to measure a given 
objective would also be influenced by the importance of the resulting 
decisions. For highly important decisions, where the consequences of mis- 
classification are serious, the number of items would need to be increased. 
Finally, with the free response format, particularly in the measurement of 
mathematics objectives, test length may be related to the relative serious- 
ness of typo I and type II errors. For free response mathematics tests, 
the likelihood of careless errors would be far greater than the likelihood 
of lucky guesses. Thus, if false negatives are more serious than false 
positives, test length may need to be increased, 

the adoption of criterion-referenced instruments for large-scale ' 
testing situations greatly increases the need for adequate theories and 
methodologies relating to criterion-referenced measurement. In classroom 
management situations, test quality is seldom critical. Other Information 
sources provide a constant check on the criterion-referenced data\^/ Since 
instructional management is a continuously ongoing process and mosf: class- 
room decisions are of a temporary nature, decisions based upon invalid or 
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inaccurate data can be readily modified at any time. On the other hand, 
survey testing ofteit represents a single data collection effort and consti- 
tutes the sole information source for the decision maker. If the results 
of such testing are likely to have far-reaching effects upon the examinees 
or upon their schools or teachers, the integrity of the data is critical. 
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